Statements

Acknowledgement
I am sincerely thank my parents and family for giving me the support and opportunity to invest my time on learning Machine Learning and Artificial Intelligence to apply in environmental management work. Furthermore, I thank the Google Career Certification courses for providing me the resources to learn {python} Programming and learn about the Machine Learning Concepts.

Use of generative artificial intelligence
Generative artificial intelligence (GenAI) was mainly used for creating charts and adjusting visualization parameters in {python}. GenAI was also used for code debugging. However, the responses provided by GenAI were critically judged before being implemented.

Executive Summary

Problem Statement
Salifort Motors is a fictional French-based alternative energy vehicle manufacturer. The HR department at Salifort Motors wants to take some initiatives to improve employee satisfaction levels at the company. They refer to you as a data analytics professional and ask you to provide data-driven suggestions based on your understanding of the data. They have the following question: what’s likely to make the employee leave the company?

Because it is time-consuming and expensive to find, interview, and hire new employees, increasing employee retention will be beneficial to the company. If the data analyst can predict the factors influencing the employees likely to quit, it might be possible to identify main factors that contribute to their leaving.

Project Aim and Focus
Goals in this project are to analyze the data collected by the HR department and to build a model that predicts whether or not an employee will leave the company.

Raw data used
This project uses a dataset called HR_capstone_dataset.csv. It represents 10 columns of self-reported information from employees of a fictitious multinational vehicle manufacturing corporation.

Methodology
The following methodology was undertaken for this project, - Raw data - HR_capstone_dataset.csv from the HR department is used to assess the needs of the Senior leadership team.
- The merged data set is split into 70% training and 30% test data which is used to train and predict using machine learning models.
- Analysis such as confusion matrix, feature importance and scoring metrics is performed to analyse the models performance in predicting the employee satisfaction levels and the main factors influencing the employees to quit.

Results
Out of the models, .

1 Introduction

Salifort Motors is a fictional French-based alternative energy vehicle manufacturer. Its global workforce of over 100,000 employees research, design, construct, validate, and distribute electric, solar, algae, and hydrogen-based vehicles. Salifort’s end-to-end vertical integration model has made it a global leader at the intersection of alternative energy and automobiles.

The HR department at Salifort Motors wants to take some initiatives to improve employee satisfaction levels at the company. They collected data from employees, but now they don’t know what to do with it. They refer to the data analytics professional and ask them to provide data-driven suggestions based on your understanding of the data. They have the following question: what’s likely to make the employee leave the company?

Because it is time-consuming and expensive to find, interview, and hire new employees, increasing employee retention will be beneficial to the company. If the data analyst can predict the factors influencing the employees likely to quit, it might be possible to identify main factors that contribute to their leaving.

2 Aim and Methodology of this Project

For this project, the key stakeholders include the HR department and the senior leadership team, as they are directly involved in employee management and decision-making. The senior leadership team has tasked the data analyst with analyzing the dataset to come up with ideas for how to increase employee retention. To help with this, they would like you to build a machine learning model that predicts whether an employee will leave the company based on their department, number of projects, average monthly hours, and any other data points you deem helpful.

Goals
The primary objective is to identify and predict the underlying drivers contributing to employee turnover, which can help in formulating effective retention strategies. Goals in this project are to analyze the data collected by the HR department and to build a model that predicts whether or not an employee will leave the company.

Methodology
For this project, the analyst chooses a method to approach this data challenge, either selecting a regression model or a tree-based machine learning model to predict whether an employee will leave the company. The following methodology was undertaken for this project,

  • Raw data - HR_capstone_dataset.csv from the HR department is used to assess the needs of the Senior leadership team.
  • The merged data set is split into 70% training and 30% test data which is used to train and predict using machine learning models.
  • Analysis such as confusion matrix, feature importance and scoring metrics is performed to analyse the models performance in predicting the employee satisfaction levels and the main factors influencing the employees to quit.

3 Exploratory Data Analysis - EDA

This project uses a dataset called HR_capstone_dataset.csv, which is downloaded from the Kaggle website here. In the EDA, the dataset is analysed and prepared for building the machine learning models. Analysis such as, - Loading the required packages and the data set

  • Checking the descriptive statistics
  • Check for missing, duplicate values and outliers
  • Visualizing the relationship within or between the data variables

3.1 Load required libraries

First, loading the libraries and packages that are needed for predicting the employee satisfaction project. The selected libraries provide functions for handling data, building and performing machine learning tasks, and visualizing results.


# Import packages
# Operational Packages
import numpy as np
import pandas as pd
import io
import pickle 

# Visualization packages
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import HTML
from IPython.display import display, Markdown
from tabulate import tabulate

# Modelling packages
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

#XGBoost
from xgboost import XGBClassifier
from xgboost import XGBRegressor
from xgboost import plot_importance

# Modelling evaluation and metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score,\
f1_score, confusion_matrix, ConfusionMatrixDisplay, classification_report
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.tree import plot_tree
from sklearn.tree import export_text

3.2 Data Loading and Pre-processing

3.2.1 Data Loading

To start the project, loading the dataset HR_capstone_dataset.csv, and analyse the basic of the dataset. The dataset represents 10 columns of self-reported information from employees of a fictitious multinational vehicle manufacturing corporation.

# Load dataset into a dataframe
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

# Load CSV
df0 = pd.read_csv(r"D:\Study\Machine Learning\Projects\R-Git\Completed projects for GitHub\Predicting-the-employee-satisfaction-levels-at-Salifort-Motors\Data\HR_capstone_dataset.csv")

# Format first 5 rows like a kable table
df0.head().style.set_table_attributes("class='table table-sm'")
  satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years Department salary
0 0.380000 0.530000 2 157 3 0 1 0 sales low
1 0.800000 0.860000 5 262 6 0 1 0 sales medium
2 0.110000 0.880000 7 272 4 0 1 0 sales medium
3 0.720000 0.870000 5 223 5 0 1 0 sales low
4 0.370000 0.520000 2 159 3 0 1 0 sales low

In this step, gaining a comprehensive understanding of the data set and preparing it for modelling is essential. This involves reviewing all variables to understand their data types, statistical distributions, and relevance to the target objective.

# Gather basic information about the data
# Create a StringIO buffer
buffer = io.StringIO()

# Capture the output of df.info() into the buffer
df0.info(buf=buffer)

# Get the content from the buffer
info_str = buffer.getvalue()

# Print the content
display(Markdown(f"```\n{info_str}\n```"))
## <IPython.core.display.Markdown object>
# Print the descriptive statistics
df0.describe().style.set_table_attributes("class='table table-sm'")
  satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years
count 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000
mean 0.612834 0.716102 3.803054 201.050337 3.498233 0.144610 0.238083 0.021268
std 0.248631 0.171169 1.232592 49.943099 1.460136 0.351719 0.425924 0.144281
min 0.090000 0.360000 2.000000 96.000000 2.000000 0.000000 0.000000 0.000000
25% 0.440000 0.560000 3.000000 156.000000 3.000000 0.000000 0.000000 0.000000
50% 0.640000 0.720000 4.000000 200.000000 3.000000 0.000000 0.000000 0.000000
75% 0.820000 0.870000 5.000000 245.000000 4.000000 0.000000 0.000000 0.000000
max 1.000000 1.000000 7.000000 310.000000 10.000000 1.000000 1.000000 1.000000

The HR_capstone_dataset.csv dataset contains 14999 row entries and 10 columns, out of which, 2 are float, 6 are integers and 2 are objects. Upon initial exploration of the data set, most of the variables in the survey data align with prediction variables but certain variables can be engineered for effective predictions. Ethical considerations at this point, is the consideration of the bias in the recorded data both during the analysis and while interpreting and presenting the results to ensure fairness and accuracy.
Descriptive analysis of the dataset is shown here. Based on this,

  • Most of the employees work on an average of ~3.8 projects and ~201 horus per month.
  • Satisfaction levels range from 0.09 - 1 with a mean of ~0.61, while the last evaluation score is ~0.72.
  • Most employees have not had accidents or been promoted in the last 5 years.

3.2.2 Data Exploration

In this step, the HR_capstone_dataset.csv dataset is then cleaned by addressing missing values, removing redundant or duplicate entries, and identifying any anomalies or inconsistencies. Outliers that could potentially distort model performance is also detected and evaluated for appropriate handling. These steps ensures that the dataset was accurate, consistent, and ready for further analysis, laying a solid foundation for building reliable predictive models.

Rename columns

As a data cleaning step, rename the columns as needed. Standardizing the column names so that they are all in snake_case, correcting any column names that are misspelled, and making sure column names more concise as needed.

# Display all column names
list(df0.columns)
## ['satisfaction_level', 'last_evaluation', 'number_project', 'average_montly_hours', 'time_spend_company', 'Work_accident', 'left', 'promotion_last_5years', 'Department', 'salary']
# Rename columns as needed
df = df0.copy()
df = df0.rename(columns={'satisfaction_level':'satisfaction',
                          'last_evaluation':'last_eval',
                          'number_project':'#_projects',
                          'average_montly_hours':'avg_mon_hrs',
                          'time_spend_company':'tenure',
                          'Work_accident':'work_accident',
                          'promotion_last_5years':'promotion_<5yrs',
                         'Department':'department'
                         })


# Display all column names after the update
list(df.columns)
## ['satisfaction', 'last_eval', '#_projects', 'avg_mon_hrs', 'tenure', 'work_accident', 'left', 'promotion_<5yrs', 'department', 'salary']

Check missing values

Checking for any missing values in the data.

# Check for missing values
df.isnull().sum().reset_index().style.set_table_attributes("class='table table-sm'")
  index 0
0 satisfaction 0
1 last_eval 0
2 #_projects 0
3 avg_mon_hrs 0
4 tenure 0
5 work_accident 0
6 left 0
7 promotion_<5yrs 0
8 department 0
9 salary 0

There appears to be no missing values in this dataset.

Check duplicates

Checking for any duplicate entries in the data.

# Check for duplicates
df.duplicated().sum()
## np.int64(3008)
# Inspect some rows containing duplicates as needed
df[df.duplicated()].head().style.set_table_attributes("class='table table-sm'")
  satisfaction last_eval #_projects avg_mon_hrs tenure work_accident left promotion_<5yrs department salary
396 0.460000 0.570000 2 139 3 0 1 0 sales low
866 0.410000 0.460000 2 128 3 0 1 0 accounting low
1317 0.370000 0.510000 2 127 3 0 1 0 sales medium
1368 0.410000 0.520000 2 132 3 0 1 0 RandD low
1461 0.420000 0.530000 2 142 3 0 1 0 sales low
# Drop duplicates and save resulting dataframe in a new variable as needed
df1 = df.drop_duplicates(keep='first')

# Display first few rows of new dataframe as needed
df1.head().style.set_table_attributes("class='table table-sm'")
  satisfaction last_eval #_projects avg_mon_hrs tenure work_accident left promotion_<5yrs department salary
0 0.380000 0.530000 2 157 3 0 1 0 sales low
1 0.800000 0.860000 5 262 6 0 1 0 sales medium
2 0.110000 0.880000 7 272 4 0 1 0 sales medium
3 0.720000 0.870000 5 223 5 0 1 0 sales low
4 0.370000 0.520000 2 159 3 0 1 0 sales low

Based on the duplicate data set, there are several continuous variables across all the 10 columns which is very highly likely that these observations are duplicates. Therefore dropping them will help in making accurate predictions.

Check outliers

Checking for outliers in the data. Certain types of models are more sensitive to outliers than others. Considering whether to remove outliers, is based on the type of models that will be used in the project.

# Create a boxplot to visualize distribution of `tenure` and detect any outliers
plt.figure(figsize=(16,6))
plt.title('Detecting outliers for tenure (Boxplot)', fontsize=15)
plt.xticks(fontsize=12)
## (array([0. , 0.2, 0.4, 0.6, 0.8, 1. ]), [Text(0.0, 0, '0.0'), Text(0.2, 0, '0.2'), Text(0.4, 0, '0.4'), Text(0.6000000000000001, 0, '0.6'), Text(0.8, 0, '0.8'), Text(1.0, 0, '1.0')])
plt.yticks(fontsize=12)
## (array([0. , 0.2, 0.4, 0.6, 0.8, 1. ]), [Text(0, 0.0, '0.0'), Text(0, 0.2, '0.2'), Text(0, 0.4, '0.4'), Text(0, 0.6000000000000001, '0.6'), Text(0, 0.8, '0.8'), Text(0, 1.0, '1.0')])
sns.boxplot(x=df1['tenure'])
plt.show()

The box plot shows that there are outliers in the tenure column. So, checking how many rows contain outliers in the tenure column.

# Determine the number of rows containing outliers
# 25th Percentile for tenure
percentile25 = df1['tenure'].quantile(0.25)

# 75th Percentile for tenure
percentile75 = df1['tenure'].quantile(0.75)

# IQR - Inter Quartile Range
iqr = percentile75 - percentile25

# Limits of the tenure
upper_limit = percentile75 + 1.5 * iqr
lower_limit = percentile25 - 1.5 * iqr
print('Lower limit:', lower_limit)
## Lower limit: 1.5
print('Upper limit:', upper_limit)
## Upper limit: 5.5
# Identifying the outliers in 'tenure'
outliers = df1[(df1['tenure'] > upper_limit) | (df1['tenure'] < lower_limit)]

# print the rows containing the outliers
print(f'Number of rows containing outliers in tenure:', len(outliers))
## Number of rows containing outliers in tenure: 824

Outlier analysis in the tenure column indicates that employees with less than 1.5 years or more than 5.5 years of tenure show notable deviations, with 824 rows flagged as outliers. Most employees tend to leave within 5 years, possibly due to lack of advancement opportunities.

3.2.3 Data Visualization

Beginning by understanding how many employees left and what percentage of all employees this figure represents.

# Get numbers of people who left vs. stayed
print(df['left'].value_counts())
## left
## 0    11428
## 1     3571
## Name: count, dtype: int64
print()
# Get percentages of people who left vs. stayed
df['left'].value_counts(normalize=True)
## left
## 0    0.761917
## 1    0.238083
## Name: proportion, dtype: float64

Examining variables that are interesting to the relevance of the project and create plots to visualize relationships between variables in the data.

  • Correlation heat maps;
  • Tenure vs satisfaction; tenure vs left distribution;
  • #_project vs avg_mon_hrs;
  • distribution of #_projects;
  • satisfaction vs salary; satisfaction vs avg_mon_hrs;
  • avg_mon_hrs vs last_eval; avg_mon_hrs vs promotion_<5yrs;
  • distribution of left;
# Select only numeric columns
numeric_df = df1.select_dtypes(include=['number'])

# Plot a correlation heatmap
plt.figure(figsize=(20, 12))
heatmap = sns.heatmap(
    numeric_df.corr(),
    vmin=-1,
    vmax=1,
    annot=True,
    fmt=".2f",  # optional: format annotation
    annot_kws={"size": 12},  # ← font size of annotation inside heatmap
    cmap=sns.color_palette("vlag", as_cmap=True),
    cbar_kws={"shrink": 0.75, "label": "Correlation"}  # optional: color bar label
)
heatmap.set_title('Correlation Heatmap', fontdict={'fontsize':20}, pad=20);

# Set x and y tick labels size
heatmap.set_xticklabels(heatmap.get_xticklabels(), rotation=45, ha='right', fontsize=14)
heatmap.set_yticklabels(heatmap.get_yticklabels(), fontsize=14)
plt.tight_layout()
plt.show()

Correlation heatmap

  • Positive Correlations: Number of projects, average monthly hours, and evaluation scores all have the highest positive correlation with each other than the rest (>0.1)
  • Negative Correlation: Whether or not an employee leaves is negatively correlates with their satisfaction level
# PLots to analyse Tenure vs satisfaction; tenure vs left distribution
# Set figure and axes
fig, ax = plt.subplots(1, 2, figsize = (20,8))

# Tenure vs left distribution
tenure_stay = df1[df1['left']==0]['tenure']
tenure_left = df1[df1['left']==1]['tenure']
sns.histplot(data=df1, x='tenure', hue='left', multiple='dodge', shrink=5, ax=ax[0])
ax[0].set_title('Tenure distribution classified by employee who left', fontsize=14)


# Tenure vs Satisfaction
sns.boxplot(data=df1, x='satisfaction', y='tenure', hue='left', orient="h", saturation=0.75, ax=ax[1])
ax[1].legend(loc='upper left', title='Left')
ax[1].invert_yaxis()
ax[1].set_title('Satisfaction vs Tenure', fontsize=14)

plt.show()

Box Plot

  • Satisfaction level is similar to early and long tenure employees
  • There is high dissatisfaction with short tenure employees who left and high satisfaction with employees who stayed with medium tenures
  • There is very low dissatisfaction level in the medium (4 year) tenure employees who left

Histogram Plot

Histogram distribution shows that only few people stay more than 5 years which might be due to promotions to higher ranks in the company

# plot for #_project vs avg_mon_hrs; distribution of #_projects 
fig, ax = plt.subplots(1, 2, figsize = (20,8))

# distribution of #_projects
projects_stay = df1[df1['left']==0]['#_projects']
projects_left = df1[df1['left']==1]['#_projects']
sns.histplot(data=df1, x='#_projects', hue='left', multiple='dodge', shrink=5, ax=ax[0])
ax[0].set_title('No of projects distribution classified by employee who left', fontsize=14)


# #_project vs avg_mon_hrs
sns.boxplot(data=df1, x='avg_mon_hrs', y='#_projects', hue='left', orient="h",saturation=0.75, ax=ax[1])
ax[1].legend(loc='upper left', title='Left')
ax[1].invert_yaxis()
ax[1].set_title('Average monthly hours by No of project', fontsize=14)
plt.show()

Based on the plots,

Histogram

  • Average monthly working hours is in the range 160 - 200 hrs.
  • Seems that employees who worked in 7 projects all left. Also employees with 6 projects worked more hours and but the ratio of who stayed and left is very similar. Here the mean hours of these groups between 250 - 300 hrs, indicating that they are overworked.
  • Optimal number of projects for the employees are 3 and 4, the people who left are considerably less than the one who stayed.

Box Plots
Employees who left the company,

  • Those who worked longer hours and more projects, because of being overworked.
  • Those who worked least hours and less projects, either they are fired or might have given notice to leave the company, so they were assigned fewer projects and worked lesser hours

# Plots for satisfaction vs salary; satisfaction vs last_eval; 
fig, ax = plt.subplots(1, 2, figsize = (20,8))

# plot for satisfaction vs salary
sns.boxplot(data=df1, x='satisfaction', y='salary', hue='left', 
            orient="h", saturation=0.75, ax=ax[0])
ax[0].invert_yaxis()
ax[0].legend(loc='upper left', title='Left')
ax[0].set_title('Satisfaction vs Salary', fontsize=14)

# Plot for satisfaction vs avg_mon_hrs
sns.scatterplot(data=df1, x='satisfaction', y='avg_mon_hrs', hue='left', alpha=0.4, ax=ax[1])
ax[1].set_title('Satisfaction level by average monthly work hours', fontsize=14)

Based on the plots,

Box plot

Salary has high relation with the satisfaction level. At low and medium salary level, there is very low satisfaction scores and high number of employees who left the company.

Scatter plot

Employees dissatisfaction level is very low who worked for long hours in the company and has a less than 0.5 satisfaction level aligns with employees who worked less hours which might be due to that they are fired or might have given notice to leave the company. This confirms with the previous box plots.


# Plot for avg_mon_hrs vs last_eval; avg_mon_hrs vs promotion_<5yrs
fig, ax = plt.subplots(1, 2, figsize = (20,8))

# Plot for avg_mon_hrs vs promotion_<5yrs
sns.scatterplot(data=df1, x='avg_mon_hrs', y='promotion_<5yrs', hue='left', ax=ax[0])
ax[0].set_title('Average monthly hours by promotion in the last 5 years', fontsize=14)

# Plot for avg_mon_hrs vs last_eval
sns.scatterplot(data=df1, x='avg_mon_hrs', y='last_eval', hue='left', alpha=0.4, ax=ax[1])
ax[1].set_title('Average monthly hours by evaluation score', fontsize=14)

Based on the plot,

Avg_mon_hrs vs Promotion_<5yrs

  • All the employees who left worked long hours and not promoted for their left the company
  • Only few employees worked the long hours were promoted.

avg_mon_hrs vs last_eval Employeed who left,

  • Overworked employees who worked well
  • Employees who worked less and with low evaluation score
  • Most of the employees work more than the average monthly work hours range
# Plot for distribution of employee who left by department
plt.figure(figsize=(13,10))
sns.histplot(data=df1, x='department', hue='left', discrete=1, 
             hue_order=[0, 1], multiple='dodge', shrink=.5)
plt.title('Employees distribution classified by department', fontsize=14)

plt.show()

Sales, Technical and Support department are the top three department where the employees left compared to the other departments

3.2.4 EDA Outcomes and Insights

Key drivers of employees who left are because,

  • Long working hours
  • High number of projects
  • Not getting a promotion for their effort
  • Low evaluation scores

Most of the employees are mostly burned out working long hours, more number of projects and not receiving any benefits such as promotion or higher salary. Dissatisfaction is prevalent among overworked staff, especially when rewards such as promotion or salary increments are absent. This findings highlight potential issues in poor company management and the company HR policies that might have to be investigated further and strategic action to improve employee satisfaction and retention.

4 Machine Learning Models

To assess the likelihood of the employee retention, logistic regression and tree based models - decision tree and random forest models were employed. The categorical variables are first encoded - salary was mapped ordinally from low to high, and department was mapped using dummy variables to retial information of all the categories. Outliers in the tenure was also removed to ensure better model stability and performance.

Logistic regression was selected for interpretability and ability to model linear relationships, while decision tree models were implemented to capture potential non-linear patters and interaction between variables. These two models offer complementary perspectives on the dataset, allowing for a more robust evaluation of predictive power and feature importance.

# Encoding the categorical into numerical 
# Copy the dataframe for the modelling
enc_df = df1.copy()

# Mapping the salary category with ordinal numbers according to hierarchy
salary_map = {'low':0, 'medium':1, 'high':2}

# Creating a new column for the salary map
enc_df['salary'] = enc_df['salary'].map(salary_map)

# Encoding the `department` with dummy variables
enc_df = pd.get_dummies(enc_df, drop_first=False)
enc_df.head().style.set_table_attributes("class='table table-sm'")
  satisfaction last_eval #_projects avg_mon_hrs tenure work_accident left promotion_<5yrs salary department_IT department_RandD department_accounting department_hr department_management department_marketing department_product_mng department_sales department_support department_technical
0 0.380000 0.530000 2 157 3 0 1 0 0 False False False False False False False True False False
1 0.800000 0.860000 5 262 6 0 1 0 1 False False False False False False False True False False
2 0.110000 0.880000 7 272 4 0 1 0 1 False False False False False False False True False False
3 0.720000 0.870000 5 223 5 0 1 0 0 False False False False False False False True False False
4 0.370000 0.520000 2 159 3 0 1 0 0 False False False False False False False True False False
# Removing the outliers in the tenure and saving it in a new dataframe
df_lr = enc_df[(enc_df['tenure'] >= lower_limit) & (enc_df['tenure'] <= upper_limit)]

df_lr.head().reset_index(drop=True).style.set_table_attributes("class='table table-sm'")
  satisfaction last_eval #_projects avg_mon_hrs tenure work_accident left promotion_<5yrs salary department_IT department_RandD department_accounting department_hr department_management department_marketing department_product_mng department_sales department_support department_technical
0 0.380000 0.530000 2 157 3 0 1 0 0 False False False False False False False True False False
1 0.110000 0.880000 7 272 4 0 1 0 1 False False False False False False False True False False
2 0.720000 0.870000 5 223 5 0 1 0 0 False False False False False False False True False False
3 0.370000 0.520000 2 159 3 0 1 0 0 False False False False False False False True False False
4 0.410000 0.500000 2 153 3 0 1 0 0 False False False False False False False True False False
list(df_lr.shape)
## [11167, 19]

4.1 Logistic Regression

Logistic regression is a supervised machine learning algorithm used for classification problems, especially when the target variable is binary. Unlike linear regression, which is used to predict continuous outcomes, logistic regression estimates the probability that a given input belongs to a particular category. It uses the logistic (sigmoid) function to map predicted values to a range between 0 and 1, making it ideal for predicting boolean outcomes.

Starting the logistic regression by setting the target and predictor variables. Then training the model with the training dataset and then using the test dataset to test the model.


# Setting the 'y' variable
y = df_lr['left']

# Setting the 'x' variable with dropping the left column
X = df_lr.drop('left', axis=1)
# Split the data into training (75%) and test (25%) dataset 
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.25, stratify=y, random_state=0)

# Constructing the LogReg model
log_clf = LogisticRegression(random_state=0, max_iter=500)

# Fitting the model
log_clf.fit(X_train,y_train)
LogisticRegression(max_iter=500, random_state=0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
# Use the model for the test dataset
y_pred = log_clf.predict(X_test)

# Constructing a confusion matrix 
# Computing values in the matrix
log_cm = confusion_matrix(y_test, y_pred, labels=log_clf.classes_)

# Create display of confusion matrix
log_disp = ConfusionMatrixDisplay(confusion_matrix = log_cm, 
                                  display_labels = log_clf.classes_)

# Plot confusion matrix
log_disp.plot(values_format='')
## <sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay object at 0x0000028E3290F8E0>
# Display plot
plt.show()

Model accurately predicts,

  • True Positive - No. of people who left = 112
  • True Negative - No. of people who will not leave = 2193
  • False Positive - No. of people who will not leave but predicted will leave = 128
  • False Negative - No. of people who will leave but predicted will not leave = 359

Checking the class imbalance

df_lr['left'].value_counts(normalize=True)
## left
## 0    0.831468
## 1    0.168532
## Name: proportion, dtype: float64

The data shows 83% - 17% split and shows imbalance. The class distribution is imbalanced, with only 17% of the data representing employees who left.

# Create classification report for logistic regression model
row_names = ['Predicted would not leave', 'Predicted would leave']

# Generate report as dict
report_logr = classification_report(y_test, y_pred, target_names = row_names, output_dict=True)

# Choose averaging strategy 
# Extract metrics
avg_type = 'weighted avg'
precision = report_logr[avg_type]['precision']
recall = report_logr[avg_type]['recall']
f1 = report_logr[avg_type]['f1-score']
accuracy = accuracy_score(y_test, y_pred)

# Needed for AUC
y_proba = log_clf.predict_proba(X_test)[:, 1] 
auc = roc_auc_score(y_test, y_proba)


# Create the row
summary_row = {
    'model': 'Logistic Regression',
    'precision': round(precision, 6),
    'recall': round(recall, 6),
    'F1': round(f1, 6),
    'accuracy': round(accuracy, 6),
    'auc': round(auc, 6)
}

# Convert to DataFrame
report_logr = pd.DataFrame([summary_row])
report_logr.style.set_table_attributes("class='table table-sm'")
  model precision recall F1 accuracy auc
0 Logistic Regression 0.793086 0.825573 0.801372 0.825573 0.892291

Classification report shows,

  • Precision = 79%
  • Recall = 83%
  • F1 score = 80%
  • Accuracy = 83%

The model shows very low scores in the objective which is the importance to predict employees who will leave.

4.2 Tree-based Model - Decision Tree & Random Forest

Tree-based models, such as Decision Trees and Random Forests, are powerful and intuitive methods used for both classification and regression tasks.

A Decision Tree splits the data into subsets based on the value of input features, creating a tree-like structure where each internal node represents a decision rule. It is easy to interpret but can be prone to overfitting.

To overcome this, Random Forest, an ensemble technique, builds multiple Decision Trees on different random subsets of the data and aggregates their predictions to improve accuracy and generalization. Random Forests reduce variance and handle high-dimensional data well, making them robust for complex modeling tasks.

Starting the tree based models by setting the target and predictor variables. Then training the model with the training dataset and then using the test dataset to test the model.


# Using the enc_df dataframe
# Setting the y variable
y = enc_df['left']

# Setting the X variable
X = enc_df.drop('left',axis=1)

# Split the data into training (75%) and test (25%) dataset 
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.25, stratify=y, random_state=0)

4.2.1 Decision Tree Model


# Instantia the decision tree model
tree = DecisionTreeClassifier(random_state=0)

# Assign a dictionary of hyperparameters to search over
cv_params = {'max_depth':[2, 4, 6, None],
             'min_samples_leaf': [2, 6, 3],
             'min_samples_split': [2, 5,7]
             }

# Assign a dictionary of scoring metrics to capture
scoring = ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']

# Instantiate GridSearch
dtree1 = GridSearchCV(tree, cv_params, scoring=scoring, cv=4, refit='roc_auc')
# Fitting the model
dtree1.fit(X_train,y_train)
GridSearchCV(cv=4, estimator=DecisionTreeClassifier(random_state=0),
             param_grid={'max_depth': [2, 4, 6, None],
                         'min_samples_leaf': [2, 6, 3],
                         'min_samples_split': [2, 5, 7]},
             refit='roc_auc',
             scoring=['accuracy', 'precision', 'recall', 'f1', 'roc_auc'])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
# Check best parameters
print(dtree1.best_params_)
## {'max_depth': 4, 'min_samples_leaf': 6, 'min_samples_split': 2}
# Check best AUC score on CV
print(dtree1.best_score_)
## 0.9698667651120891

def make_results(model_name:str, model_object, metric:str):
    '''
    Arguments:
        model_name (string): what you want the model to be called in the output table
        model_object: a fit GridSearchCV object
        metric (string): precision, recall, f1, accuracy, or auc
  
    Returns a pandas df with the F1, recall, precision, accuracy, and auc scores
    for the model with the best mean 'metric' score across all validation folds.  
    '''

    # Create dictionary that maps input metric to actual metric name in GridSearchCV
    metric_dict = {'auc': 'mean_test_roc_auc',
                   'precision': 'mean_test_precision',
                   'recall': 'mean_test_recall',
                   'f1': 'mean_test_f1',
                   'accuracy': 'mean_test_accuracy'
                  }

    # Get all the results from the CV and put them in a df
    cv_results = pd.DataFrame(model_object.cv_results_)

    # Isolate the row of the df with the max(metric) score
    best_estimator_results = cv_results.iloc[cv_results[metric_dict[metric]].idxmax(), :]

    # Extract Accuracy, precision, recall, and f1 score from that row
    auc = best_estimator_results.mean_test_roc_auc
    f1 = best_estimator_results.mean_test_f1
    recall = best_estimator_results.mean_test_recall
    precision = best_estimator_results.mean_test_precision
    accuracy = best_estimator_results.mean_test_accuracy
  
    # Create table of results
    table = pd.DataFrame()
    table = pd.DataFrame({'model': [model_name],
                          'precision': [precision],
                          'recall': [recall],
                          'F1': [f1],
                          'accuracy': [accuracy],
                          'auc': [auc]
                        })
  
    return table
  
# Get all CV scores
dtree1_cv_results = make_results('Decision Tree 1 CV', dtree1, 'auc')
dtree1_cv_results.reset_index(drop=True, inplace=True)
dtree1_cv_results.style.set_table_attributes("class='table table-sm'")
  model precision recall F1 accuracy auc
0 Decision Tree 1 CV 0.914490 0.916279 0.915345 0.971867 0.969867

The Decision Tree model demonstrated strong performance, achieving high scores across key metrics such as,

  • Precision = 91.4%
  • Recall = 91.6%
  • F1-score = 91.5%
  • Accuracy = 97.2%
  • AUC = 0.97

These results indicate that the model fits the data well. However, the decision tree model is prone to overfitting. To address this concern and ensure better generalization, Random Forest Model was performed to compare the models.

4.2.2 Random Forest


# Instantiate model
rf = RandomForestClassifier(random_state=0)

# Assign a dictionary of hyperparameters to search over
cv_params = {'max_depth': [3,5, None], 
             'max_features': [1.0],
             'max_samples': [0.7, 1.0],
             'min_samples_leaf': [1,2,3],
             'min_samples_split': [2,3,4],
             'n_estimators': [300, 500],
             }      

# Assign a dictionary of scoring metrics to capture
scoring = ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']

# Instantiate GridSearch
rf1 = GridSearchCV(rf, cv_params, scoring=scoring, cv=4, refit='roc_auc', n_jobs=-1)
# Fitting the model
rf1.fit(X_train, y_train) 
GridSearchCV(cv=4, estimator=RandomForestClassifier(random_state=0), n_jobs=-1,
             param_grid={'max_depth': [3, 5, None], 'max_features': [1.0],
                         'max_samples': [0.7, 1.0],
                         'min_samples_leaf': [1, 2, 3],
                         'min_samples_split': [2, 3, 4],
                         'n_estimators': [300, 500]},
             refit='roc_auc',
             scoring=['accuracy', 'precision', 'recall', 'f1', 'roc_auc'])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
# Check best params
rf1.best_params_
## {'max_depth': 5, 'max_features': 1.0, 'max_samples': 0.7, 'min_samples_leaf': 1, 'min_samples_split': 4, 'n_estimators': 500}
  
# Check best AUC score on CV
rf1.best_score_
## np.float64(0.9804250949807172)
# Get all CV scores
rf1_cv_results = make_results('Random Forest 1 CV', rf1, 'auc')
results = pd.concat([rf1_cv_results,dtree1_cv_results], axis=0)
results.reset_index(drop=True, inplace=True)
results.style.set_table_attributes("class='table table-sm'")
  model precision recall F1 accuracy auc
0 Random Forest 1 CV 0.950023 0.915614 0.932467 0.977983 0.980425
1 Decision Tree 1 CV 0.914490 0.916279 0.915345 0.971867 0.969867

The Random Forest model also demonstrated strong performance on the training set, achieving high scores across key metrics such as,

  • Precision = 95.0%
  • Recall = 91.6%
  • F1-score = 93.2%
  • Accuracy = 97.8%
  • AUC = 0.98

These metrics indicate the model’s strong ability to correctly classify both classes while maintaining high generalization performance.

Based on the model training results,

  • Random Forest Model scores better than the Decision Tree, achieved higher precision, F1-score, accuracy, and AUC.
  • Random Forest Random Forest helps reduce overfitting, a known issue with standalone Decision Trees and improving predictive reliability.
  • Random Forest Model performs well than the Decision Tree and the test set can be evaluated using the Random Forest.

def get_scores(model_name:str, model, X_test_data, y_test_data):
    '''
    Generate a table of test scores.

    In: 
        model_name (string):  How you want your model to be named in the output table
        model:                A fit GridSearchCV object
        X_test_data:          numpy array of X_test data
        y_test_data:          numpy array of y_test data

    Out: pandas df of precision, recall, f1, accuracy, and AUC scores for your model
    '''

    preds = model.best_estimator_.predict(X_test_data)

    auc = roc_auc_score(y_test_data, preds)
    accuracy = accuracy_score(y_test_data, preds)
    precision = precision_score(y_test_data, preds)
    recall = recall_score(y_test_data, preds)
    f1 = f1_score(y_test_data, preds)

    table = pd.DataFrame({'model': [model_name],
                          'precision': [precision], 
                          'recall': [recall],
                          'f1': [f1],
                          'accuracy': [accuracy],
                          'AUC': [auc]
                         })
  
    return table
  
# Get predictions on test data
rf1_test_scores = get_scores('Random Forest 1 Test', rf1, X_test, y_test)
rf1_test_scores.style.set_table_attributes("class='table table-sm'")
  model precision recall f1 accuracy AUC
0 Random Forest 1 Test 0.964211 0.919679 0.941418 0.980987 0.956439

The Random Forest model also demonstrated strong performance on the test set, achieving high scores across key metrics such as,

  • Precision = 96.4%
  • Recall = 92%
  • F1-score = 94%
  • Accuracy = 98.1%
  • AUC = 0.97

Test results are similar and slightly higher to the training results, which shows that the model is very good. Scores on precision, recall and F1-score effectively balances both false positive and false negatives and accuracy and AUC indicating excellent discriminatory power between the class. The model’s performance will be similar when new unseen data is fitted, as the test data was used only for this model.

Round 1 Models included all the variables as features for the model prediction. For the Round 2 Models, Feature engineering will be used to customize the data for improving the model.

5 Machine Learning Model - After Feature Engineering

Relevant variables were selected and transformed to improve model performance, including encoding categorical features, handling missing values, and creating meaningful derived features to enhance predictive accuracy. What can be engineered in this dataset,

  • Satisfaction level cannot be reported for all the employees. So, dropping it would be an option
  • Average monthly hours might have data leakage, as it might be recorded after the employees gives notice to resign or company has given the notice to leave. So, maybe engineering this variable to a new variable as overworked might help improve the models prediction
# Drop `satisfaction_level` and save resulting dataframe in new variable
df2 = enc_df.drop('satisfaction', axis=1)

# Display first few rows of new dataframe
df2.head().style.set_table_attributes("class='table table-sm'")
  last_eval #_projects avg_mon_hrs tenure work_accident left promotion_<5yrs salary department_IT department_RandD department_accounting department_hr department_management department_marketing department_product_mng department_sales department_support department_technical
0 0.530000 2 157 3 0 1 0 0 False False False False False False False True False False
1 0.860000 5 262 6 0 1 0 1 False False False False False False False True False False
2 0.880000 7 272 4 0 1 0 1 False False False False False False False True False False
3 0.870000 5 223 5 0 1 0 0 False False False False False False False True False False
4 0.520000 2 159 3 0 1 0 0 False False False False False False False True False False
# Create `overworked` column. For now, it's identical to average monthly hours.
df2['overworked'] = df2['avg_mon_hrs']

# Inspect max and min average monthly hours values
print('Max hours:', df2['overworked'].max())
## Max hours: 310
print('Min hours:', df2['overworked'].min())
## Min hours: 96
# Define `overworked` as working > 175 hrs/week
df2['overworked'] = (df2['overworked'] > 175).astype(int)

# Display first few rows of new column
df2[['overworked']].head().style.set_table_attributes("class='table table-sm'")
  overworked
0 0
1 1
2 1
3 1
4 0

Assuming the 40 hrs job/per week with two weeks vacation policy, Average working hours per month = 40 hours * 50 weeks / 12 months = 166.67 hours. Overworked can be defined as working hours more than 175 hours per month on average. Therefore, employees working more than 175 hours/month were classified as overworked (1), while others were labeled as not overworked (0).

To enrich the dataset with meaningful predictors, a new binary feature overworked was engineered based on average monthly working hours. This engineered feature adds interpretability to the model and allows it to capture the potential impact of excessive working hours on employee behavior or outcomes.

# Drop the `average_monthly_hours` column
df2 = df2.drop('avg_mon_hrs', axis=1)

# Display first few rows of resulting dataframe
df2.head().style.set_table_attributes("class='table table-sm'")
  last_eval #_projects tenure work_accident left promotion_<5yrs salary department_IT department_RandD department_accounting department_hr department_management department_marketing department_product_mng department_sales department_support department_technical overworked
0 0.530000 2 3 0 1 0 0 False False False False False False False True False False 0
1 0.860000 5 6 0 1 0 1 False False False False False False False True False False 1
2 0.880000 7 4 0 1 0 1 False False False False False False False True False False 1
3 0.870000 5 5 0 1 0 0 False False False False False False False True False False 1
4 0.520000 2 3 0 1 0 0 False False False False False False False True False False 0

# Isolate the outcome variable
y = df2['left']

# Select the features
X = df2.drop('left', axis=1)

# Create test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, stratify=y, random_state=0)

5.0.1 Decision Tree 2


# Instantiate model
tree = DecisionTreeClassifier(random_state=0)

# Assign a dictionary of hyperparameters to search over
cv_params = {'max_depth':[4, 6, 8, None],
             'min_samples_leaf': [2, 5, 1],
             'min_samples_split': [2, 4, 6]
             }

# Assign a dictionary of scoring metrics to capture
scoring = ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']

# Instantiate GridSearch
dtree2 = GridSearchCV(tree, cv_params, scoring=scoring, cv=4, refit='roc_auc')
 
dtree2.fit(X_train, y_train)
GridSearchCV(cv=4, estimator=DecisionTreeClassifier(random_state=0),
             param_grid={'max_depth': [4, 6, 8, None],
                         'min_samples_leaf': [2, 5, 1],
                         'min_samples_split': [2, 4, 6]},
             refit='roc_auc',
             scoring=['accuracy', 'precision', 'recall', 'f1', 'roc_auc'])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
# Check best params
dtree2.best_params_
## {'max_depth': 6, 'min_samples_leaf': 2, 'min_samples_split': 6}
# Check best AUC score on CV
dtree2.best_score_
## np.float64(0.9586752505340426)
# Get all CV scores
dtree2_cv_results = make_results('Decision Tree 2 CV', dtree2, 'auc')
results = pd.concat([dtree1_cv_results,dtree2_cv_results,rf1_cv_results], axis=0)
results.reset_index(drop=True, inplace=True)
results.style.set_table_attributes("class='table table-sm'")
  model precision recall F1 accuracy auc
0 Decision Tree 1 CV 0.914490 0.916279 0.915345 0.971867 0.969867
1 Decision Tree 2 CV 0.856693 0.903553 0.878882 0.958523 0.958675
2 Random Forest 1 CV 0.950023 0.915614 0.932467 0.977983 0.980425

The Decision tree 2 model balanced performance across all metrics, but slightly underperforms compared to both Decision Tree 1 CV and Random Forest 1 CV. such as,

  • Precision = 85.7%
  • Recall = 90.4%
  • F1-score = 87.9%
  • Accuracy = 95.9%
  • AUC = 0.96

5.0.2 Random Forest 2


# Instantiate model
rf = RandomForestClassifier(random_state=0)

# Assign a dictionary of hyperparameters to search over
cv_params = {'max_depth': [3,5, None], 
             'max_features': [1.0],
             'max_samples': [0.7, 1.0],
             'min_samples_leaf': [1,2,3],
             'min_samples_split': [2,3,4],
             'n_estimators': [300, 500],
             }   

# Assign a dictionary of scoring metrics to capture
scoring = ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']

# Instantiate GridSearch
rf2 = GridSearchCV(rf, cv_params, scoring=scoring, cv=4, refit='roc_auc', n_jobs=-1)
# Fitting the Model
rf2.fit(X_train, y_train)
GridSearchCV(cv=4, estimator=RandomForestClassifier(random_state=0), n_jobs=-1,
             param_grid={'max_depth': [3, 5, None], 'max_features': [1.0],
                         'max_samples': [0.7, 1.0],
                         'min_samples_leaf': [1, 2, 3],
                         'min_samples_split': [2, 3, 4],
                         'n_estimators': [300, 500]},
             refit='roc_auc',
             scoring=['accuracy', 'precision', 'recall', 'f1', 'roc_auc'])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
# Check best params
rf2.best_params_
## {'max_depth': 5, 'max_features': 1.0, 'max_samples': 0.7, 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 300}
# Check best AUC score on CV
rf2.best_score_
## np.float64(0.9648100662833985)
# Get all CV scores
rf2_cv_results = make_results('Random Forest 2 CV', rf2, 'auc')
results = pd.concat([dtree1_cv_results,dtree2_cv_results,rf1_cv_results,rf2_cv_results], axis=0)
results.reset_index(drop=True, inplace=True)
results.style.set_table_attributes("class='table table-sm'")
  model precision recall F1 accuracy auc
0 Decision Tree 1 CV 0.914490 0.916279 0.915345 0.971867 0.969867
1 Decision Tree 2 CV 0.856693 0.903553 0.878882 0.958523 0.958675
2 Random Forest 1 CV 0.950023 0.915614 0.932467 0.977983 0.980425
3 Random Forest 2 CV 0.866758 0.878754 0.872407 0.957411 0.964810

The Random Forest 2 CV model balanced performance across all metrics, but slightly underperforms compared to Random Forest 1 CV and slightly better than Decision Tree 2 CV, with scores being,

  • Precision = 86.6%
  • Recall = 87.9%
  • F1-score = 87.2%
  • Accuracy = 95.7%
  • AUC = 0.964

Based on the training results for the two rounds of Decision Tree and Random Forest Model,

  • So Random Forest 1 CV model is the winning model and the test can now be used for prediction
  • In general, Random Forest model performs well with ROC-AUC score as the deciding metric.

Plotting a Confusion Matrix to visualize the model’s predictions on the test set


# Generate array of values for confusion matrix
preds = rf2.best_estimator_.predict(X_test)
cm = confusion_matrix(y_test, preds, labels=rf2.classes_)

# Plot confusion matrix
disp = ConfusionMatrixDisplay(confusion_matrix=cm,
                             display_labels=rf2.classes_)
disp.plot(values_format='');

# Get predictions on test data
rf2_test_scores = get_scores('Random Forest 2 Test', rf2, X_test, y_test)
test_results = pd.concat([rf1_test_scores, rf2_test_scores], axis=0)
test_results.reset_index(drop=True, inplace=True)
test_results.style.set_table_attributes("class='table table-sm'")
  model precision recall f1 accuracy AUC
0 Random Forest 1 Test 0.964211 0.919679 0.941418 0.980987 0.956439
1 Random Forest 2 Test 0.870406 0.903614 0.886700 0.961641 0.938407

A perfect model would yield all true negatives and true positives, and no false negatives or false positives. The Random Forest models demonstrated strong performance on the test set. These results highlight the robustness and consistency of Random Forest classifiers in predicting employee outcomes.

Comparing the two models, Model 1 (baseline Random Forest) outperforms Model 2 (feature-engineered Random Forest) in terms of precision, F1-score, accuracy, and AUC. While Model 2 shows a slight improvement in recall (0.90 vs 0.92), the drop in precision and overall performance metrics suggests that the current feature engineering may have introduced noise or redundant information rather than enhancing signal quality.

In this case, the feature engineering did not improve model performance. Instead, it slightly degraded the classifier’s ability to distinguish between classes. This highlights the importance of validating each feature transformation step, as not all feature engineering enhances model learning — it can also obscure useful signals or increase dimensionality unnecessarily.

6 Results and Evaluation

6.1 Logistic Regression

report_logr.style.set_table_attributes("class='table table-sm'")
  model precision recall F1 accuracy auc
0 Logistic Regression 0.793086 0.825573 0.801372 0.825573 0.892291

The model shows very low scores in the objective which is the importance to predict employees who will leave. The data shows 83% - 17% split and shows imbalance. The class distribution is imbalanced, with only 17% of the data representing employees who left. This imbalance may contribute to the model’s bias toward predicting that employees will stay.

Based on the insights and limitations, it is worth exploring alternative classification models such as Decision Tree and Random Forest, which may handle non-linear relationships and imbalanced data more effectively, and potentially improve prediction of employee attrition.

6.2 Decision Tree

# Get all CV scores
dt_results = pd.concat([dtree1_cv_results,dtree2_cv_results], axis=0)
dt_results.reset_index(drop=True, inplace=True)
dt_results.style.set_table_attributes("class='table table-sm'")
  model precision recall F1 accuracy auc
0 Decision Tree 1 CV 0.914490 0.916279 0.915345 0.971867 0.969867
1 Decision Tree 2 CV 0.856693 0.903553 0.878882 0.958523 0.958675

Decision Tree 1 (CV) outperformed Decision Tree 2 (CV) across all evaluation metrics. It achieved a higher precision, recall, and F1 score, indicating better balance between false positives and false negatives. Additionally, it showed superior accuracy and AUC, suggesting a more robust overall classification performance.

6.2.1 Plotting the Decision Tree

# Plot the tree
plt.figure(figsize=(85,50))
plot_tree(dtree2.best_estimator_, max_depth=6, fontsize=14, feature_names=X.columns, 
          class_names={0:'stayed', 1:'left'}, filled=True);
plt.show()

# Extract rules as plain text
tree_rules = export_text(dtree2.best_estimator_, feature_names=list(X.columns), max_depth=6)
print(tree_rules)
## |--- #_projects <= 2.50
## |   |--- last_eval <= 0.57
## |   |   |--- overworked <= 0.50
## |   |   |   |--- last_eval <= 0.44
## |   |   |   |   |--- class: 0
## |   |   |   |--- last_eval >  0.44
## |   |   |   |   |--- tenure <= 2.50
## |   |   |   |   |   |--- class: 0
## |   |   |   |   |--- tenure >  2.50
## |   |   |   |   |   |--- tenure <= 3.50
## |   |   |   |   |   |   |--- class: 1
## |   |   |   |   |   |--- tenure >  3.50
## |   |   |   |   |   |   |--- class: 0
## |   |   |--- overworked >  0.50
## |   |   |   |--- department_sales <= 0.50
## |   |   |   |   |--- department_IT <= 0.50
## |   |   |   |   |   |--- class: 0
## |   |   |   |   |--- department_IT >  0.50
## |   |   |   |   |   |--- tenure <= 3.00
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |--- tenure >  3.00
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |--- department_sales >  0.50
## |   |   |   |   |--- last_eval <= 0.47
## |   |   |   |   |   |--- class: 0
## |   |   |   |   |--- last_eval >  0.47
## |   |   |   |   |   |--- last_eval <= 0.48
## |   |   |   |   |   |   |--- class: 1
## |   |   |   |   |   |--- last_eval >  0.48
## |   |   |   |   |   |   |--- class: 0
## |   |--- last_eval >  0.57
## |   |   |--- last_eval <= 1.00
## |   |   |   |--- last_eval <= 0.75
## |   |   |   |   |--- department_technical <= 0.50
## |   |   |   |   |   |--- department_marketing <= 0.50
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |--- department_marketing >  0.50
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |--- department_technical >  0.50
## |   |   |   |   |   |--- last_eval <= 0.59
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |--- last_eval >  0.59
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |--- last_eval >  0.75
## |   |   |   |   |--- work_accident <= 0.50
## |   |   |   |   |   |--- salary <= 1.50
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |--- salary >  1.50
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |--- work_accident >  0.50
## |   |   |   |   |   |--- class: 0
## |   |   |--- last_eval >  1.00
## |   |   |   |--- salary <= 0.50
## |   |   |   |   |--- class: 1
## |   |   |   |--- salary >  0.50
## |   |   |   |   |--- class: 0
## |--- #_projects >  2.50
## |   |--- tenure <= 3.50
## |   |   |--- #_projects <= 5.50
## |   |   |   |--- work_accident <= 0.50
## |   |   |   |   |--- salary <= 1.50
## |   |   |   |   |   |--- last_eval <= 0.95
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |--- last_eval >  0.95
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |--- salary >  1.50
## |   |   |   |   |   |--- class: 0
## |   |   |   |--- work_accident >  0.50
## |   |   |   |   |--- department_sales <= 0.50
## |   |   |   |   |   |--- class: 0
## |   |   |   |   |--- department_sales >  0.50
## |   |   |   |   |   |--- last_eval <= 0.84
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |--- last_eval >  0.84
## |   |   |   |   |   |   |--- class: 0
## |   |   |--- #_projects >  5.50
## |   |   |   |--- department_support <= 0.50
## |   |   |   |   |--- last_eval <= 0.89
## |   |   |   |   |   |--- department_technical <= 0.50
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |--- department_technical >  0.50
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |--- last_eval >  0.89
## |   |   |   |   |   |--- department_sales <= 0.50
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |--- department_sales >  0.50
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |--- department_support >  0.50
## |   |   |   |   |--- overworked <= 0.50
## |   |   |   |   |   |--- last_eval <= 0.73
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |--- last_eval >  0.73
## |   |   |   |   |   |   |--- class: 1
## |   |   |   |   |--- overworked >  0.50
## |   |   |   |   |   |--- last_eval <= 0.64
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |--- last_eval >  0.64
## |   |   |   |   |   |   |--- class: 0
## |   |--- tenure >  3.50
## |   |   |--- last_eval <= 0.76
## |   |   |   |--- #_projects <= 6.50
## |   |   |   |   |--- department_technical <= 0.50
## |   |   |   |   |   |--- #_projects <= 5.50
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |--- #_projects >  5.50
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |--- department_technical >  0.50
## |   |   |   |   |   |--- #_projects <= 3.50
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |--- #_projects >  3.50
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |--- #_projects >  6.50
## |   |   |   |   |--- class: 1
## |   |   |--- last_eval >  0.76
## |   |   |   |--- #_projects <= 4.50
## |   |   |   |   |--- tenure <= 4.50
## |   |   |   |   |   |--- last_eval <= 0.99
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |--- last_eval >  0.99
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |--- tenure >  4.50
## |   |   |   |   |   |--- #_projects <= 3.50
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |--- #_projects >  3.50
## |   |   |   |   |   |   |--- class: 1
## |   |   |   |--- #_projects >  4.50
## |   |   |   |   |--- overworked <= 0.50
## |   |   |   |   |   |--- department_support <= 0.50
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |   |--- department_support >  0.50
## |   |   |   |   |   |   |--- class: 0
## |   |   |   |   |--- overworked >  0.50
## |   |   |   |   |   |--- #_projects <= 5.50
## |   |   |   |   |   |   |--- class: 1
## |   |   |   |   |   |--- #_projects >  5.50
## |   |   |   |   |   |   |--- class: 1

Simplified, readable form of the main logic in the tree:

6.2.2 Feature importance

# Feature important
dtree2_importances = pd.DataFrame(dtree2.best_estimator_.feature_importances_, 
                                 columns=['gini_importance'], 
                                 index=X.columns
                                )
dtree2_importances = dtree2_importances.sort_values(by='gini_importance', ascending=False)

# Only extract the features with importances > 0
dtree2_importances = dtree2_importances[dtree2_importances['gini_importance'] != 0]
dtree2_importances.style.set_table_attributes("class='table table-sm'")
  gini_importance
last_eval 0.343958
#_projects 0.343385
tenure 0.215681
overworked 0.093498
department_support 0.001142
salary 0.000910
department_sales 0.000607
department_technical 0.000418
work_accident 0.000183
department_IT 0.000139
department_marketing 0.000078
sns.barplot(data=dtree2_importances, x="gini_importance", y=dtree2_importances.index, orient='h')
plt.title("Decision Tree: Feature Importances for Employee Leaving", fontsize=14)
plt.ylabel("Feature")
plt.xlabel("Importance")
plt.show()

Feature importance plot for the decision tree model shows that last_evaluation, #_project, tenure, and overworked have the importance from high to the least which helps in predicting the outcome variable ‘employee left’. In contrast, features like department, salary, and work accident contribute minimally to the prediction. This suggests that performance evaluation, workload, and time spent at the company are key factors influencing employee attrition.

6.3 Random forest

# Get all rf CV scores
rf_results = pd.concat([rf1_cv_results,rf2_cv_results], axis=0)
rf_results.reset_index(drop=True, inplace=True)
rf_results.style.set_table_attributes("class='table table-sm'")
  model precision recall F1 accuracy auc
0 Random Forest 1 CV 0.950023 0.915614 0.932467 0.977983 0.980425
1 Random Forest 2 CV 0.866758 0.878754 0.872407 0.957411 0.964810
# Get all rf test scores
rf_test_results = pd.concat([rf1_test_scores, rf2_test_scores], axis=0)
rf_test_results.reset_index(drop=True, inplace=True)
rf_test_results.style.set_table_attributes("class='table table-sm'")
  model precision recall f1 accuracy AUC
0 Random Forest 1 Test 0.964211 0.919679 0.941418 0.980987 0.956439
1 Random Forest 2 Test 0.870406 0.903614 0.886700 0.961641 0.938407

The Random Forest 1 model consistently outperformed Random Forest 2 across both cross-validation (CV) and test datasets. In cross-validation, Random Forest 1 achieved a higher F1 score and accuracy compared to Random Forest 2. Similarly, in the test set, Random Forest 1 yielded better performance with an F1 score and accuracy than Random Forest 2. The higher AUC values across both CV and test sets further confirm the superior classification performance of Random Forest 1.

6.3.1 Feature importance

Now, plot the feature importance for the Random Forest 2 model.

# Get feature importances
feat_impt = rf2.best_estimator_.feature_importances_

# Get indices of top 10 features
ind = np.argpartition(rf2.best_estimator_.feature_importances_, -10)[-10:]

# Get column labels of top 10 features 
feat = X.columns[ind]

# Filter `feat_impt` to consist of top 10 feature importance
feat_impt = feat_impt[ind]

y_df = pd.DataFrame({"Feature":feat,"Importance":feat_impt})
y_sort_df = y_df.sort_values("Importance")
fig = plt.figure()
ax = fig.add_subplot(111)

y_sort_df.plot(kind='barh', ax=ax, x="Feature", y="Importance")

ax.set_title("Random Forest 2: Important variables that have an impact in employees leaving", fontsize = 14)
ax.set_ylabel("Feature")
ax.set_xlabel("Importance")
plt.show()

Feature importance plot for the Random Forest is the same as the Decision Tree model - feature importance plot

6.4 Comparing and Evaluating the Models

# Get all model scores
all_models = [report_logr, dtree1_cv_results, dtree2_cv_results, rf1_cv_results, rf2_cv_results, rf1_test_scores, rf2_test_scores]

model_results = pd.concat(all_models, axis=0)

model_results.reset_index(drop=True, inplace=True)
model_results.style.set_table_attributes("class='table table-sm'")
  model precision recall F1 accuracy auc f1 AUC
0 Logistic Regression 0.793086 0.825573 0.801372 0.825573 0.892291 nan nan
1 Decision Tree 1 CV 0.914490 0.916279 0.915345 0.971867 0.969867 nan nan
2 Decision Tree 2 CV 0.856693 0.903553 0.878882 0.958523 0.958675 nan nan
3 Random Forest 1 CV 0.950023 0.915614 0.932467 0.977983 0.980425 nan nan
4 Random Forest 2 CV 0.866758 0.878754 0.872407 0.957411 0.964810 nan nan
5 Random Forest 1 Test 0.964211 0.919679 nan 0.980987 nan 0.941418 0.956439
6 Random Forest 2 Test 0.870406 0.903614 nan 0.961641 nan 0.886700 0.938407

7 Conclusion, Recommendations, Next Steps

From the initial assessment, EDA and Visualization, the employees are overworked due to the poor company management. This is also confirmed with the model and feature importance

Following recommendations could be presented to the stakeholders for retaining the employees:

  • Limiting or capping the number of projects that employees can work on.
  • Investigating the dissatisfaction of the four year tenure employees
  • Gratification or providing benefits for the employees working longer hours
  • Informing the employees about the overtime work policies
  • Evaluation scores must be defined and measured on a proportionate scale to be balanced for rewarding employees who work hard or put in more effort.

Next Steps Having a structured method for getting employees evaluation and satisfaction score before the employee leaves the company, as this might tend to data leakage. This might help in mitigating this issues and will help improve the model’s performance.